TCR-L: an analysis tool for evaluating the association between the T-cell receptor repertoire and clinical phenotypes

Background T cell receptors (TCRs) play critical roles in adaptive immune responses, and recent advances in genome technology have made it possible to examine the T cell receptor (TCR) repertoire at the individual sequence level. The analysis of the TCR repertoire with respect to clinical phenotypes can yield novel insights into the etiology and progression of immune-mediated diseases. However, methods for association analysis of the TCR repertoire have not been well developed. Methods We introduce an analysis tool, TCR-L, for evaluating the association between the TCR repertoire and disease outcomes. Our approach is developed under a mixed effect modeling, where the fixed effect represents features that can be explicitly extracted from TCR sequences while the random effect represents features that are hidden in TCR sequences and are difficult to be extracted. Statistical tests are developed to examine the two types of effects independently, and then the p values are combined. Results Simulation studies demonstrate that (1) the proposed approach can control the type I error well; and (2) the power of the proposed approach is greater than approaches that consider fixed effect only or random effect only. The analysis of real data from a skin cutaneous melanoma study identifies an association between the TCR repertoire and the short/long-term survival of patients. Conclusion The TCR-L can accommodate features that can be extracted as well as features that are hidden in TCR sequences. TCR-L provides a powerful approach for identifying association between TCR repertoire and disease outcomes.


Demonstration of the TCRhom computation
Supplementary  Table S1 shows a simple data structure which contains two subjects' repertoires R 1 and R 2 . The first subject's repertoire R 1 has three unique amino acid sequences a 1,1 = CASSYSVGTDTQYF, a 1,2 = CASSDGGGNEQFF, and a 1,3 = CASSLIRNGYT. The corresponding abundances are w 1,1 = 1, w 1,2 = 4, and w 1,3 = 2. The second subject's repertoire R 2 has two unique amino acid sequences a 2,1 = CASSELLTGYTF and a 2,2 = CASSYGFRWGGEQYF with the abundances w 2,1 = 3 and w 2,2 = 1, respectively. For the computation of s(a 1,k , a 2,l ), k = 1, 2, 3, l = 1, 2, we use the BLOSUM62 substitution matrix with the default affine gap penalty for pairwise alignments via the Needleman-Wunsch algorithm. The affine gap penalty is referred to as gap opening + gap extension × length of gaps, where the gap opening is the cost of creating a gap (−) and the gap extension is the cost of extending the gap.
Then the TCRhom S 1,2 is computed as where f (R i ) is a p × 1 vector of the extracted features of the ith subject's TCR repertoire R i and W = [W 1 W 2 · · · W r ] is a p × r matrix whose W l , l = 1, 2, . . . , r, is a p × 1 vector of amino acid characteristic corresponding to the p amino acids of the feature vector. We adapt the proof in the MiST (Sun et al., 2013) framework to prove the independence between U τ 2 and U η .
Let A = I − ΩX(X ΩX) −1 X and B = I − ΩZ(Z ΩZ) −1 Z , where Ω is a n × n diagonal variance matrix of y and I is a n × n identity matrix. By Taylor's expansion, we can approximate We consider the following eigen-decomposition of S: S = m k=1 λ k u k u k , where λ 1 ≥ · · · ≥ λ m are the m non-negative eigenvalues of S and u k = [u k1 u k2 · · · u kn ] is the n × 1 corresponding Hence, we get ij Ω jj = 0 for j = 1, 2, . . . , n.

Additional simulation studies
For each of the sample sizes (n = 350, 400, 450), 1, 000 simulations were performed. Two confounding variables were generated: for i = 1, 2, . . . , n, X i1 followed a Bernoulli distribution with the probability of 0.5 and X i2 followed a standard normal distribution. We simulated two characteristics W 1 and W 2 as follows: W 1 was a 20 × 1 vector whose the elements were uniformly distributed from −3 through 3 and W 2 was a 20 × 1 vector whose the elements were normally distributed with a mean of 0 and a variance of 4. We used a 20 × 1 vector for f (R i ), i = 1, 2, . . . , n,.
We considered the fixed effects followed a multivariate normal distribution with a mean vector of 0s and a variance-covariance matrix τ 2 S, where S was specified based on either BLOSUM62 or PAM250.
Here, both the fixed and random effects contributed to the trait. The variance-covariance matrix of the random effect, S, was specified based on the PAM250 in Table S2, and hence, the power for the TCRL-P250 was the highest at each sample size. A comparable power was observed in the TCRL-B62 in Table S2, indicating that the TCR-L approaches are fairly robust to the choice of substitution matrix for S. This finding is consistent in Table S3 when S was specified based on BLOSUM62. Table S3 shows that the TCRL-B62 has the highest power and the TCRL-P250 has a comparable power. Alternatively, we also evaluated the power when only the fixed effect was considered. In this case, we set

More simulations based on real data
In this section, we conducted simulation studies based on the skin cutaneous melanoma (SKCM) dataset in TCGA. The data have been described in the Real Data Analysis section in the main text.
We generated two confounding variables X 1 and X 2 to emulate gender and age in the real data.
In detail, the X 1 was generated from a Bernoulli distribution with the probability parameter being 0.66, where 0.66 is the proportion of male subjects in the SKCM data. To mimic the distribution of the age in the real data, we generated X 2 from a normal distribution with mean 56 and standard deviation 16. The age observed from the SKCM data was between 15 and 86; thus, if the simulated age was out of this range, we re-generated X 2 from the normal distribution until X 2 falls into the range. For the parameter setup, we observed from the real data analysis that the regression coefficients for intercept, gender and age wereβ 0 = −2.9,β 1 = 0.23, andβ 2 = 0.034, respectively, thus we used these values for the parameters in our simulations.
To mimic the S matrix in the real data, we directly sampled the TCR repertoire from the subjects in the SKCM data. In other words, the TCR repertoire R i was not generated through the simulation scheme described in the main text, but from the real TCR repertoire of the SKCM patients. After removing subjects with a single amino acid sequence, we had 349 subjects in the SKCM dataset. Then, we sampled R i for i = 1, 2, . . . , n from the TCR repertoires of the 349 subjects without replacement for each simulation replicate. Next, the weighted hydrophobic score for the ith subject, W T f (R i ), and the TCRhom matrix, S, were computed accordingly. Once the S was computed, the random effects [h(R 1 ), . . . , h(R n )] T were generated from N (0, τ 2 S).
We considered three scenarios for power comparison: the model contained only fixed effect, the model contained only random effect, and the model contained both fixed and random effects. These three scenarios were denoted as SI, SII, and SIII, respectively. The parameter settings for SI, SII, and SIII are given in Table S5. For SII and SIII, depending on whether the S matrix was derived from BLOSUM62 or PAM250, the data simulation schemes were named as SII-B62, SIII-B62, or SII-P250, SIII-P250. Supplementary We considered n = 250 and 300, and simulated the data such that the sample size per group was no less than 10% of n. We replicated 1,000 times and evaluated the power at the significance level of α = 0.05. The results of the power comparison were summarized in Tables S6 -S9. These results showed that the TCRL had a robust performance in SI and SII and was the most competitive approach in SIII. The observed pattern of power was similar to that of the main text. Supplementary